Skip to content

Conversation

chenjian2664
Copy link
Contributor

@chenjian2664 chenjian2664 commented Mar 14, 2025

Description

We already prune AddFile entries based on column statistics when generating splits. This PR extends that optimization by applying pruning earlier — while reading active entries from the checkpoint file.
By filtering out irrelevant entries at this stage, we can reduce memory usage and potentially improve split generation performance, especially for large checkpoint files

Additional context and related issues

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Mar 14, 2025
@chenjian2664 chenjian2664 marked this pull request as draft March 14, 2025 06:34
@github-actions github-actions bot added the delta-lake Delta Lake connector label Mar 14, 2025
@chenjian2664 chenjian2664 force-pushed the prune_checkpoint branch 3 times, most recently from ec3ab17 to eda4dc9 Compare March 14, 2025 07:16
@chenjian2664 chenjian2664 force-pushed the prune_checkpoint branch 3 times, most recently from 7b5cc12 to cf01677 Compare March 28, 2025 04:06
@chenjian2664 chenjian2664 marked this pull request as ready for review March 28, 2025 07:19
@chenjian2664
Copy link
Contributor Author

[Upload test results: testing/trino-product-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeCreateOrReplaceTableAsSelectCompatibility.java#L66](https://github.com/trinodb/trino/pull/25311/files#annotation_33278177652)
java.sql.SQLException: Query failed (#20250328_064948_00693_c5gfv): Failed to generate splits for default.test_replace_table_with_schema_change_ritx8f6mot

Is it flaky ?

@chenjian2664
Copy link
Contributor Author

chenjian2664 commented Mar 28, 2025

io.trino.spi.TrinoException: Unable to parse value [1961-01-01 01:02:03] from column ts with type timestamp(3) with time zone

@chenjian2664 chenjian2664 marked this pull request as draft March 28, 2025 23:04
@chenjian2664
Copy link
Contributor Author

The stats that were recorded in the checkpoint file(stats) has sth like '1960-01-01 01:02:03' which is the varchar type, but after create or replace the type changed to the timestamp with timezone type, that causes the failure while figuring out the min/max value.

@chenjian2664 chenjian2664 force-pushed the prune_checkpoint branch 2 times, most recently from 66879be to 76333ea Compare April 4, 2025 10:14
@chenjian2664 chenjian2664 marked this pull request as ready for review April 4, 2025 13:08
@chenjian2664 chenjian2664 changed the title [WIP] Prune add file entry using columns stats in Delta checkpoint iterator Prune add file entry using columns stats in Delta checkpoint iterator Apr 4, 2025
@chenjian2664 chenjian2664 requested review from ebyhr and findinpath April 4, 2025 23:48
@findinpath
Copy link
Contributor

Pls address the code conflicts

@findinpath findinpath requested a review from raunaqmorarka April 7, 2025 11:46
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we see a test in TestDeltaLakeFileOperations to showcase on the file access level the effectiveness of this contribution ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be hard to observe a noticeable difference in file access, because we already prune AddFile entries using column statistics during split generation, specifically in DeltaLakeSplitManager#getSplits and FileBasedTableStatisticsProvider#getTableStatistics.
The commit is apply that pruning earlier, by filtering out AddFile entries that don’t satisfy the predicates while reading from the checkpoint, what we can see is that the size of add entries comes from is pruned, which shows in the added test

@findinpath
Copy link
Contributor

Pls add a high-level description to add potential reviewers now or later in a few months to understand why was this contribution added.
Currently the title of the PR states what technically is happening, but doesn't explain what can be achieved functionally with this change.

@chenjian2664 chenjian2664 force-pushed the prune_checkpoint branch 2 times, most recently from 9f3b998 to bff16a8 Compare April 29, 2025 03:00
Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase to latest master

@chenjian2664 chenjian2664 force-pushed the prune_checkpoint branch 2 times, most recently from e0bf36f to 5dd7a3f Compare May 10, 2025 10:19
@raunaqmorarka raunaqmorarka requested a review from Copilot May 12, 2025 11:24
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request optimizes the Delta checkpoint iterator by pruning AddFile entries based on column statistics earlier in the processing pipeline. Key changes include:

  • Introducing early filtering using a union of partition and non-partition constraints.
  • Adding tests to verify the behavior for various statistics and constraint scenarios.
  • Adjusting API visibility and renaming variables to better represent the union of constraints.

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.

Show a summary per file
File Description
spark_default_format_timestamp_timezone_statistics.json Added test JSON file with timestamp statistics.
TestDeltaLakeFileStatistics.java Added tests for parsing Spark default timestamp formats.
TestCheckpointEntryIterator.java Introduced tests to validate statistic-based entry pruning.
DeltaLakeDomains.java Updated to use a local variable for clarity when matching partition keys.
DeltaLakeJsonFileStatistics.java Modified timestamp parsing logic to include a fallback using exception handling.
CheckpointEntryIterator.java Replaced partitionConstraint with a union-based domains variable.
TransactionLogParser.java Made readPartitionTimestampWithZone public to support wider usage.
TransactionLogAccess.java, FileBasedTableStatisticsProvider.java, DeltaLakeSplitManager.java, DeltaLakeMetadata.java Updated to leverage the union of constraints to filter entries earlier.
AddFileEntry.java Refactored statistics extraction into a static helper method.
Comments suppressed due to low confidence (2)

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeJsonFileStatistics.java:181

  • The use of exception-driven fallback in readStatisticsTimestampWithZone may impact performance if exceptions occur frequently. Consider refactoring the logic to avoid relying on exceptions for normal control flow.
ZonedDateTime zonedDateTime;

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogParser.java:186

  • The access modifier of readPartitionTimestampWithZone has been changed from package-private to public; please confirm that this change is intentional and that the broader API exposure is acceptable.
static Long readPartitionTimestampWithZone(String timestamp)

@chenjian2664 chenjian2664 force-pushed the prune_checkpoint branch 3 times, most recently from 9f644ba to 41422a9 Compare May 12, 2025 14:59
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR extends the optimization for pruning AddFile entries in Delta checkpoint files by applying column statistics filtering earlier in the processing pipeline. Key changes include:

  • Enhancing test coverage with new test cases for statistic-based pruning.
  • Refactoring method signatures to introduce the non-partition constraint and merging it with the partition constraint.
  • Updating logic in various modules (checkpoint iterator, split manager, metadata, etc.) to incorporate statistics-based filtering.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/transactionlog/checkpoint/TestCheckpointEntryIterator.java Added new tests for early pruning via column statistics.
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeSplitManager.java Updated method signatures to handle non-partition constraints.
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/util/DeltaLakeDomains.java Adjusted partition value retrieval with an explicit null check.
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/checkpoint/CheckpointEntryIterator.java Refactored constraint handling and inlined statistics predicate evaluation.
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogAccess.java Updated active file filtering to use intersected constraints.
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java Passed non-partition constraint to checkpoint log entry generation.
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/AddFileEntry.java Modified stats parsing to return the computed statistics predicate.
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/statistics/FileBasedTableStatisticsProvider.java Adjusted statistics predicate creation call with new parameter.
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSplitManager.java Updated split generation to utilize the new constraint merging logic.
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java Updated table creation and validation calls to include non-partition constraints.

@starburstdata-automation
Copy link

starburstdata-automation commented May 13, 2025

Started benchmark workflow for this PR with test type = delta_lake/sf1000_parquet_part.

Building Trino finished with status: success
Benchmark finished with status: success
Comparing results to the static baseline values, follow above workflow link for more details/logs.
Status message: No baseline found.
Benchmark Comparison to the closest run from Master: Report

@starburstdata-automation
Copy link

starburstdata-automation commented May 13, 2025

Started benchmark workflow for this PR with test type = delta_lake/sf1000_parquet_unpart.

Building Trino finished with status: success
Benchmark finished with status: success
Comparing results to the static baseline values, follow above workflow link for more details/logs.
Status message: No baseline found.
Benchmark Comparison to the closest run from Master: Report

Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes lgtm
Can you do some benchmarking with large checkpoints to see whether this additional pruning has significant benefits ?
At best this is saving the parsing of a few fields for AddFilesEntry, so it will be good to evaluate the benefit before we add to the complexity of this code.

Copy link

github-actions bot commented Jun 3, 2025

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

@github-actions github-actions bot added the stale label Jun 3, 2025
Copy link

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this Jun 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

4 participants